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Abstract 

We describe and analyze efficient algorithms 
for learning a linear predictor from examples 
when the learner can only view a few at- 
tributes of each training example. This is the 
case, for instance, in medical research, where 
each patient participating in the experiment 
is only willing to go through a small number 
of tests. Our analysis bounds the number 
of additional examples sufficient to compen- 
sate for the lack of full information on each 
training example. We demonstrate the ef- 
ficiency of our algorithms by showing that 
when running on digit recognition data, they 
obtain a high prediction accuracy even when 
the learner gets to see only four pixels of each 
image. 

1. Introduction 

Suppose we would like to predict if a person has some 
disease based on medical tests. Theoretically, we may 
choose a sample of the population, perform a large 
number of medical tests on each person in the sample 
and learn from this information. In many situations 
this is unrealistic, since patients participating in the 
experiment are not willing to go through a large num- 
ber of medical tests. The above example motivates the 
problem studied in this paper, that is learning when 
there is a hard constraint on the number of attributes 
the learner may view for each training example. 

We propose an efficient algorithm for dealing with this 
partial information problem, and bound the number 
of additional training examples sufficient to compen- 
sate for the lack of full information on each training 

Appearing in Proceedings of the 27 th International Confer- 
ence on Machine Learning, Haifa, Israel, 2010. Copyright 
2010 by the author(s)/owner(s). 



CESA-BIANCHI@DSI.UNIMI.IT 
SHAIS@CS. HUJI.AC.IL 
OHADSH@CS. HUJI.AC.IL 

example. Roughly speaking, we actively pick which 
attributes to observe in a randomized way so as to con- 
struct a "noisy" version of all attributes. Intuitively, 
we can still learn despite the error of this estimate 
because instead of receiving the exact value of each in- 
dividual example in a small set it suffices to get noisy 
estimations of many examples. 

1.1. Related Work 

Many methods have been proposed for dealing with 
missing or partial information. Most of the approaches 
do not come with formal guarantees on the risk of 
the resulting algorithm, and are not guaranteed to 
converge in polynomial time. The difficulty stems 
from the exponential number of ways to complete 
the missing information. In the framework of gener- 
ative models, a popular approach is the Expectation- 
Maximization (EM) procedure (Dempster et al., 1977). 
The main drawback of the EM approach is that it 
might find sub-optimal solutions. In contrast, the 
methods wc propose in this paper are provably effi- 
cient and come with finite sample guarantees on the 
risk. 

Our technique for dealing with missing information 
borrows ideas from algorithms for the adversarial 
multi- armed bandit problem (Auer et al., 2003; Cesa- 
Bianchi and Lugosi, 2006). Our learning algorithms 
actively choose which attributes to observe for each ex- 
ample. This and similar protocols were studied in the 
context of active learning (Cohn et al., 1994; Balcan 
et al., 2006; Hanneke, 2007; 2009; Beygelzimer et al., 
2009), where the learner can ask for the target associ- 
ated with specific examples. 

The specific learning task we consider in the paper was 
first proposed in (Ben-David and Dichterman, 1998), 
where it is called "learning with restricted focus of 
attention". Ben-David and Dichterman (1998) consid- 
ered the classification setting and showed learnability 
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of several hypothesis classes in this model, like fc-DNF 
and axis-aligned rectangles. However, to the best of 
our knowledge, no efficient algorithm for the class of 
linear predictors has been proposed. 1 

A related setting, called budgeted learning, was re- 
cently studied - see for example (Deng et al., 2007; 
Kapoor and Greiner, 2005) and the references therein. 
In budgeted learning, the learner purchases attributes 
at some fixed cost subject to an overall budget. Be- 
sides lacking formal guarantees, this setting is different 
from the one we consider in this paper, because we im- 
pose a budget constraint on the number of attributes 
that can be obtained for every individual example, as 
opposed to a global budget. In some applications, such 
as the medical application discussed previously, our 
constraint leads to a more realistic data acquisition 
process - the global budget allows to ask for many at- 
tributes of some individual patients while our protocol 
guarantees a constant number of medical tests to all 
the patients. 

Our technique is reminiscent of methods used in the 
compressed learning framework (Calderbank et al., 
2009; Zhou et al., 2009), where data is accessed via a 
small set of random linear measurements. Unlike com- 
pressed learning, where learners are both trained and 
evaluated in the compressed domain, our techniques 
are mainly designed for a scenario in which only the 
access to training data is restricted. 

The "opposite" setting, in which full information is 
given at training time and the goal is to train a predic- 
tor that depends only on a small number of attributes 
at test time, was studied in the context of learning 
sparse predictors - see for example (Tibshirani, 1996) 
and the wide literature on sparsity properties of i\ 
rcgularization. Since our algorithms also enforce low 
l\ norm, many of those results can be combined with 
our techniques to yield an algorithm that views only 
O(l) attributes at training time, and a number of at- 
tributes comparable to the achievable sparsity at test 
time. Since our focus in this work is on constrained 
information at training time, we do not elaborate on 
this subject. Furthermore, in some real-world situa- 
tions, it is reasonable to assume that attributes are 
very expensive at training time but are more easy to 
obtain at test time. Returning to the example of med- 
ical applications, it is unrealistic to convince patients 
to participate in a medical experiment in which they 
need to go through a lot of medical tests, but once the 
system is trained, at testing time, patients who need 

1 Ben-David and Dichterman (1998) do describe learn- 
ability results for similar classes but only under the re- 
stricted family of product distributions. 



the prediction of the system will agree to perform as 
many medical tests as needed. 

A variant of the above setting is the one studied by 
Greiner et al. (2002), where the learner has all the in- 
formation at training time and at test time he tries to 
actively choose a small amount of attributes to form a 
prediction. Note that active learning at training time, 
as we do here, may give more learning power than ac- 
tive learning at testing time. For example, we formally 
prove that while it is possible to learn a consistent pre- 
dictor accessing at most 2 attributes of each example 
at training time, it is not possible (even with an infi- 
nite amount of training examples) to build an active 
classifier that uses at most 2 attributes of each exam- 
ple at test time, and whose error will be smaller than 
a constant. 

2. Main Results 

In this section we outline the main results. We start 
with a formal description of the learning problem. In 
linear regression each example is an instance-target 
pair, (x, y) e R d x R. We refer to x as a vector of 
attributes and the goal of the learner is to find a lin- 
ear predictor x i— > (w, x), where we refer to w G M. as 
the predictor. The performance of a predictor w on an 
instance-target pair, (x, y) <G R d x R, is measured by a 
loss function £((w, x), y). For simplicity, we focus on 
the squared loss function, £(a, b) = (a — b) 2 , and briefly 
discuss other loss functions in Section 5. Following the 
standard framework of statistical learning (Haussler, 
1992; Devroye et al., 1996; Vapnik, 1998), we model 
the environment as a joint distribution T> over the 
set of instance-target pairs, R d x R. The goal of the 
learner is to find a predictor with low risk, defined as 

the expected loss: Lx>(w) = f E( x . y )^x>[^(( w > x), y)}. 
Since the distribution T> is unknown to the learner 
he learns by relying on a training set of m examples 
S = (xi, yi), . . . , (x m , y m ), which are assumed to be 
sampled i.i.d. from V. We denote the training loss by 

L s (w) = sEili(( w > x '>-!/i) 2 - We now distinguish 
between two scenarios: 

• Full information: The learner receives the en- 
tire training set. This is the traditional linear re- 
gression setting. 

• Partial information: For each individual exam- 
ple, (xj,j/j), the learner receives the target yi but 
is only allowed to see k attributes of Xj , where k is 
a parameter of the problem. The learner has the 
freedom to actively choose which of the attributes 
will be revealed, as long as at most k of them will 
be given. 
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While the full information case was extensively stud- 
ied, the partial information case is more challenging. 
Our approach for dealing with the problem of partial 
information is to rely on algorithms for the full infor- 
mation case and to fill in the missing information in a 
randomized, data and algorithmic dependent, way. As 
a simple baseline, we begin by describing a straight- 
forward adaptation of Lasso (Tibshirani, 1996), based 
on a direct nonadaptive estimate of the loss function. 
We then turn to describe a more effective approach, 
which combines a stochastic gradient descent algo- 
rithm called Pegasos (Shalev-Shwartz et al., 2007) with 
the active sampling of attributes in order to estimate 
the gradient of the loss at each step. 

2.1. Baseline Algorithm 

A popular approach for learning a linear regressor is 
to minimize the empirical loss on the training set plus 
a regularization term taking the form of a norm of the 
predictor. For example, in ridge regression the regu- 
larization term is || w| 1 2 and in Lasso the regularization 
term is ||w||i. Instead of regularization, we can include 
a constraint of the form ||w||i < B or ||w|| 2 < B. With 
an adequate tuning of parameters, the regularization 
form is equivalent to the constraint form. In the con- 
straint form, the predictor is a solution to the following 
optimization problem: 

™£, W\ E ((w,x)-y) 2 s.t. \\w\\ p <B, (1) 

where S = {(xi, j/i), . . . , (x m , y m )} is a training set of 
examples, B is a regularization parameter, and p is 
1 for Lasso and 2 for ridge regression. Standard risk 
bounds for Lasso imply that if w is a minimizer of (1) 
(with p = 1), then with probability greater than 1 — S 
over the choice of a training set of size m we have 

Ww)< min L D (w) + \B 2 \ l ^ d/S ^ ) . (2) 

w:||w||i<S \ V m J 

To adapt Lasso to the partial information case, we first 
rewrite the squared loss as follows: 

((w, x) - y) 2 = w T (xx T )w - 2yx T w + y 2 , 

where w,x are column vectors and w T ,x T are their 
corresponding transpose (i.e., row vectors). Next, we 
estimate the matrix xx T and the vector x using the 
partial information we have, and then we solve the 
optimization problem given in (1) with the estimated 
values of xx T and x. To estimate the vector x we 
can pick an index i uniformly at random from [d] = 
{l,...,d} and define the estimation to be a vector v 



such that 

{dx r if r = i 
else ' (3) 

It is easy to verify that v is an unbiased estimate of x, 
namely, E[v] = x where expectation is with respect to 
the choice of the index i. When we are allowed to see 
k > 1 attributes, we simply repeat the above process 
(without replacement) and set v to be the averaged 
vector. To estimate the matrix xx T we could pick two 
indices i, j independently and uniformly at random 
from [d] , and define the estimation to be a matrix with 
all zeros except d the entry. However, 

this yields a non-symmetric matrix which will make 
our optimization problem with the estimated matrix 
non-convex. To overcome this obstacle, we symmetrize 
the matrix by adding its transpose and dividing by 
2. The resulting baseline procedure 2 is given in Algo- 
rithm 1. 



Algorithm 1 Baseline^, k) 

S — full information training set with m examples 
k — Can view only k elements of each instance in S 

Parameter: B 

Initialize: i = 0e R d ' d ■ v = e R d ; y = 
for each (x, y) e S 
v = e R d 
A = e R d ' d 

Choose C uniformly at random from 

all subsets of [d] x [d] of size k/2 
for each E C 

Vi = Vi + (d/k)xi 
Vj = Vj + (d/k) Xj 

j — A'i j ~\~ (d j /c) x^Xj 

end 

A = A + A/m 
v = v + 2 y v/m 

y = y + y 2 /m 

end 

Let Ls(w) = w T A\v + w T v + y 
Output: solution of min £s(w) 

w:||w||i<S 

2 We note that an even simpler approach is to arbitrarily 
assume that the correlation matrix is the identity matrix 
and then the solution to the loss minimization problem 
is simply the averaged vector, w = y j gS ?/x. In that 
case, we can simply replace x by its estimated vector as de- 
fined in (3). While this naive approach can work on very 
simple classification tasks, it will perform poorly on realis- 
tic data sets, in which the correlation matrix is not likely 
to be identity. Indeed, in our experiments with the MNIST 
data set, we found out that this approach performed poorly 
relatively to the algorithms proposed in this paper. 
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The following theorem shows that similar to Lasso, 
the Baseline algorithm is competitive with the optimal 
linear predictor with a bounded L\ norm. 

Theorem 1 Let V be a distribution such that P[x € 
[-l,+l] d Aye [-1.+1]] = 1- Let w be the output of 
Baseline(S,k), where \S\ = m. Then, with probability 
of at least 1 — 6 over the choice of the training set and 
the algorithm's own randomization we have 

M*)< Mw) + o(<^y^f) . 

w:||w||i<B \ k V m J 

The above theorem tells us that for a sufficiently large 
training set we can find a very good predictor. Put 
another way, a large number of examples can compen- 
sate for the lack of full information on each individual 
example. In particular, to overcome the extra factor 
d 2 jk in the bound, which does not appear in the full 
information bound given in (2), we need to increase m 
by a factor of d 4 /k 2 . 

Note that when k = d we do not recover the full infor- 
mation bound. This is because we try to estimate a 
matrix with d 2 entries using only k = d < d 2 samples. 
In the next subsection, we describe a better, adaptive 
procedure for the partial information case. 

2.2. Gradient-based Attribute Efficient 
Regression 

In this section, by avoiding the estimation of the ma- 
trix xx T , we significantly decrease the number of addi- 
tional examples sufficient for learning with k attributes 
per training example. To do so, we do not try to esti- 
mate the loss function but rather estimate the gradient 
W(w) = 2 ((w, x) — y) x, with respect to w, of the 
squared loss function ((w,x) — y) 2 . Each vector w 
can define a probability distribution over [d] by letting 
= | iOi | / 1| w ||i. We can estimate the gradient using 
2 attributes as follows. First, we randomly pick j from 
[d] according to the distribution defined by w. Using 
j we estimate the term (w, x) by sgn(w,-) ||w||i Xj. It 
is easy to verify that the expectation of the estimate 
equals (w,x). Second, we randomly pick i from [d] 
according to the uniform distribution over [d]. Based 
on i, we estimate the vector x as in (3). Overall, we 
obtain the following unbiased estimation of the gradi- 
ent: 

W(w) = 2(sgn(t« j )||w||ia; J --i/)v, (4) 
where v is as defined in (3). 

The advantage of the above approach over the loss 
based approach we took before is that the magnitude 



of each element of the gradient estimate is order of 
d || w|| i - This is in contrast to what we had for the loss 
based approach, where the magnitude of each element 
of the matrix A was order of d 2 . In many situations, 
the L\ norm of a good predictor is significantly smaller 
than d and in these cases the gradient based estimate 
is better than the loss based estimate. However, while 
in the previous approach our estimation did not de- 
pend on a specific w, now the estimation depends on 
w. We therefore need an iterative learning method 
in which at each iteration we use the gradient of the 
loss function on an individual example. Luckily, the 
stochastic gradient descent approach conveniently fits 
our needs. 

Concretely, below we describe a variant of the Pegasos 
algorithm (Shalev-Shwartz et al., 2007) for learning 
linear regressors. Pegasos tries to minimize the regu- 
larized risk 

min E [((w,x)-j/) 2 ] + A||w||| . (5) 

Of course, the distribution T> is unknown, and there- 
fore we cannot hope to solve the above problem ex- 
actly. Instead, Pegasos finds a sequence of weight vec- 
tors that (on average) converge to the solution of (5). 
We start with the all zeros vector w = € R d . Then, 
at each iteration Pegasos picks the next example in the 
training set (which is equivalent to sampling a fresh ex- 
ample according to V) and calculates the gradient of 
the loss function on this example with respect to the 
current weight vector w. In our case, the gradient is 
simply 2((w, x) — y)x. We denote this gradient vector 
by V. Finally, Pegasos updates the predictor accord- 
ing to the rule: w = (1 — |) w — ^ V, where t is the 
current iteration number. 

To apply Pegasos in the partial information case we 
could simply replace the gradient vector V with its 
estimation given in (4). However, our analysis shows 
that it is desirable to maintain an estimation vector 
V with small magnitude. Since the magnitude of V is 
order of d ||w||i, where w is the current weight vector 
maintained by the algorithm, we would like to ensure 
that ||w||i is always smaller than some threshold B. 
We achieve this goal by adding an additional projec- 
tion step at the end of each Pegasos's iteration. For- 
mally, after performing the update we set 

w <— argmin j|u — w|| 2 . (6) 

u:||u||i<B 

This projection step can be performed efficiently in 
time 0(d) using the technique described in (Duchi 
et al., 2008). A pseudo-code of the resulting Attribute 
Efficient Regression algorithm is given in Algorithm 2. 
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Algorithm 2 AER(5, k) 

S — Full information training set with m examples 
k — Access only k elements of each instance in S 
Parameters: A, B 

w=(0, ...,0) ; w = w ; t = l 
for each (x, y) e S 
v = e R d 

Choose C uniformly at random from 

all subsets of [d] of size k/2 
for each j e C 

V 3 = V 3 + l dx 3 

end 

y = o 

for r = 1, . . . , k/2 
sample i from [d] based on P[j] = |io»|/||w||i 

V = y+ | s g n ( w ») ll w lli^ 
end 

w = (1- i)w- l t (y-y)v 

w = argmin^Huii^B ||u - w|| 2 
w = w + w/m 
t = t + l 
end 

Output: w 



The following theorem provides convergence guaran- 
tees for AER. 

Theorem 2 Let V be a distribution such that P[x e 
[-l,+l] d Ay e [-1,+1]] = 1. Let w* be any vector 
such that ||w*||i < B and ||w*|| 2 < B 2 Then, 

n ^ }] < ^n + o{i^±^^Sj , 

where \S\ =m, w is the output of AER(S, k) run with 
A = ((B+l)d/ B 2 ) \J\og(m) / (mk) , and the expectation 
is over the choice of S and over the algorithm's own 
randomization. 

For simplicity and readability, in the above theorem we 
only bounded the expected risk. It is possible to obtain 
similar guarantees with high probability by relying on 
Azuma's inequality — see for example (Cesa-Bianchi 
ct al., 2004). 

Note that ||w*|| 2 < ||w*||i < B, so Theorem 2 implies 
that 

T , x • T , n „ fdB 2 /ln(m)\ 

w^lwll^s \Vk \ m J 

Therefore, the bound for AER is much better 3 than 

3 When comparing bounds, we ignore logarithmic terms. 
Also, in this discussion we assume that B\ and B 2 are at 
least 1. 



the bound for Baseline: instead of d 2 /k we have d/y/k. 

It is interesting to compare the bound for AER to the 
Lasso bound in the full information case given in (2). 
As it can be seen, to achieve the same level of risk, 
AER needs a factor of d 2 /k more examples than the 
full information Lasso. 4 Since each AER example uses 
only k attributes while each Lasso example uses all 
d attributes, the ratio between the total number of 
attributes AER needs and the number of attributes 
Lasso needs to achieve the same error is 0(d). Intu- 
itively, when having d times total number of attributes, 
we can fully compensate for the partial information 
protocol. 

However, in some situations even this extra d fac- 
tor is not needed. Suppose we know that the vector 
w*, which minimizes the risk, is dense. That is, it 
satisfies ||w*||i w v^||w*||2. In this case, choosing 
£?2 = B/y/d, the bound in Theorem 2 becomes order 
of B 2 \j 1 d/k\J\/m. Therefore, the number of examples 
AER needs in order to achieve the same error as Lasso 
is only a factor d/k more than the number of exam- 
ples Lasso uses. But, this implies that both AER and 
Lasso needs the same number of attributes in order to 
achieve the same level of error! Crucially, the above 
holds only if w* is dense. When w* is sparse we have 
|| w* ||i w || w* || 2 and then AER needs more attributes 
than Lasso. 

One might wonder whether a more clever active sam- 
pling strategy could attain in the sparse case the per- 
formance of Lasso while using the same number of at- 
tributes. The next subsection shows that this is not 
possible in general. 

2.3. Lower bounds and negative results 

We now show (proof in the appendix) that any at- 
tribute efficient algorithm needs in general order oid/e 
examples for learning an e-accurate sparse linear pre- 
dictor. Recall that the upper bound of AER implies 
that order of d 2 (B + \) 2 B\/e 2 examples are sufficient 
for learning a predictor with Lj>(w) — Lx>(w*) < e. 
Specializing this sample complexity bound of AER 
to the w* described in Theorem 3 below, yields that 
0(d 2 /e) examples are sufficient for AER for learning 
a good predictor in this case. That is, we have a gap 
of factor d between the lower bound and the upper 
bound, and it remains open to bridge this gap. 

Theorem 3 For any e e (0,1/16), k, and d > Ak, 

4 We note that when d — k we still do not recover the 
full information bound. However, it is possible to improve 
the analysis and replace the factor d/y/k with a factor 
dmax t ||x t || 2 /fc. 
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there exists a distribution over examples and a weight 
vector w*, with ||w*|| = 1 and ||w*|| 2 = |w*j|i = 
2y/e, such that any attribute efficient regression algo- 
rithm accessing at most k attributes per training ex- 
ample must see (in expectation) at least ^(^) ex- 
amples in order to learn a linear predictor w with 
L-d(w) — Lx>(w*) < e . 

Recall that in our setting, while at training time the 
learner can only view k attributes of each example, at 
test time all attributes can be observed. The setting 
of Greiner et al. (2002), instead, assumes that at test 
time the learner cannot observe all the attributes. The 
following theorem shows that if a learner can view at 
most 2 attributes at test time then it is impossible to 
give accurate predictions at test time even when the 
optimal linear predictor is known. 

Theorem 4 There exists a weight vector w* and a 
distribution V such that L-p(w*) = while any algo- 
rithm A that gives predictions A(x) while viewing only 
2 attributes of each x must have Ld(A) > 1/9. 

The proof is given in the appendix. This negative re- 
sult highlights an interesting phenomenon. We can 
learn an arbitrarily accurate predictor w from par- 
tially observed examples. However, even if we know 
the optimal w*, we might not be able to accurately 
predict a new partially observed example. 

3. Proof Sketch of Theorem 2 

Here we only sketch the proof of Theorem 2. A com- 
plete proof of all our theorems is given in the appendix. 

We start with a general logarithmic regret bound for 
strongly convex functions (Hazan et al., 2006; Kakade 
and Shalev-Shwartz, 2008). The regret bound implies 
the following. Let zi, . . . , z m be a sequence of vectors, 
each of which has norm bounded by G. Let A > and 
consider the sequence of functions g\ . . . . , g m such that 
<7t(w) = §||w|| 2 + (z t ,w). Each g t is A-strongly con- 
vex (meaning, it is not too flat), and therefore regret 
bounds for strongly convex functions tell us that there 
is a way to construct a sequence of vectors Wi , . . . , w m 
such that for any w* that satisfies ||w*||i < B we have 



ra ^ m 



G log(m) 
A m 



With an appropriate choice of A, and with the assump- 
tion || w* || 2 < B 2 , the above inequality implies that 

£ £r=i w * - -*> < « - here « = o ( GB2 ^ (m) 

This holds for any sequence of zi, . . . , z TO , and in par- 
ticular, we can set z t = 2(y t — y*)v t . Note that z t is a 



random vector that depends both on the value of w t 
and on the random bits chosen on round t. Taking 
conditional expectation of z t w.r.t. the random bits 
chosen on round t we obtain that E[z t |w t ] is exactly 
the gradient of ((w, x t ) — y t ) 2 at w t , which we denote 
by V t . From the convexity of the squared loss, we 
can lower bound (V t ,w t - w*) by ((w t ,x t ) - y t ) 2 - 
((w*,x t ) — y t ) 2 . That is, in expectation we have that 



E 



iV((<w 4 ,x t )- yt ) 2 -(<w*,x t )-y t ) 2 ) 



m Z_ ^ 



< a 



Taking expectation w.r.t. the random choice of the 
examples from T>, denoting w = — X^t=i' anc ^ usm S 
Jensen's inequality we get that E[Lx>(w)] < L-u(yv*) + 
a. Finally, we need to make sure that a is not too 
large. The only potential danger is that G, the bound 
on the norms of z 1; . . . , z m , will be large. We make 
sure this cannot happen by restricting each w t to the 
l\ ball of radius B, which ensures that ||z t || < 0((B + 
l)d) for all t. 

4. Experiments 

We performed some preliminary experiments to test 
the behavior of our algorithm on the well-known 
MNIST digit recognition dataset (Cun et al., 1998), 
which contains 70,000 images (28 x 28 pixels each) of 
the digits — 9. The advantages of this dataset for 
our purposes is that it is not a small-scale dataset, has 
a reasonable dimcnsionality-to-data-size ratio, and the 
setting is clearly interpretable graphically. While this 
dataset is designed for classification (e.g. recognizing 
the digit in the image), we can still apply our algo- 
rithms on it by regressing to the label. 

First, to demonstrate the hardness of our settings, we 
provide in Figure 1 below some examples of images 
from the dataset, in the full information setting and 
the partial information setting. The upper row con- 
tains six images from the dataset, as available to a 
full-information algorithm. A partial-information al- 
gorithm, however, will have a much more limited ac- 
cess to these images. In particular, if the algorithm 
may only choose k — 4 pixels from each image, the 
same six images as available to it might look like the 
bottom row of Figure 1 . 

We began by looking at a dataset composed of "3 vs. 
5", where all the 3 digits were labeled as —1 and all 
the 5 digits were labeled as +1. We ran four differ- 
ent algorithms on this dataset: the simple Baseline 
algorithm, AER, as well as ridge regression and Lasso 
for comparison (for Lasso, we solved (1) with p = 1). 
Both ridge regression and Lasso were run in the full in- 
formation setting: Namely, they enjoyed full access to 
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Figure 1. In the upper row six examples from the training 
set (of digits 3 and 5) are shown. In the lower row we show 
the same six examples, where only four randomly sampled 
pixels from each original image are displayed. 

all attributes of all examples in the training set. The 
Baseline algorithm and AER, however, were given ac- 
cess to only 4 attributes from each training example. 

We randomly split the dataset into a training set and 
a test set (with the test set being 10% of the origi- 
nal dataset). For each algorithm, parameter tuning 
was performed using 10-fold cross validation. Then, 
we ran the algorithm on increasingly long prefixes of 
the training set, and measured the average regression 
error ((w,x) — y) 2 on the test set. The results (av- 
eraged over runs on 10 random train-test splits) are 
presented in Figure 2. In the upper plot, we see how 
the test regression error improves with the number of 
examples. The Baseline algorithm is highly unstable 
at the beginning, probably due to the ill-conditioning 
of the estimated covariance matrix, although it even- 
tually stabilizes (to prevent a graphical mess at the 
left hand side of the figure, we removed the error bars 
from the corresponding plot). Its performance is worse 
than AER, completely in line with our earlier theoret- 
ical analysis. 

The bottom plot of Figure 2 is similar, only that 
now the A-axis represents the accumulative number 
of attributes seen by each algorithm rather than the 
number of examples. For the partial-information al- 
gorithm, the graph ends at approximately 49,000 at- 
tributes, which is the total number of attributes ac- 
cessed by the algorithm after running over all train- 
ing examples, seeing k = 4 pixels from each example. 
However, for the full-information algorithm 49,000 at- 
tributes are already seen after just 62 examples. When 
we compare the algorithms in this way, we see that 
our AER algorithm achieves excellent performance for 
a given attribute budget, significantly better than the 
other ii-based algorithms, and even comparable to 
full-information ridge regression. 

Finally, we tested the algorithms over 45 datasets gen- 
erated from MNIST, one for each possible pair of dig- 
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Figure 2. Test regression error for each of the 4 algorithms, 
over increasing prefixes of the training set for "3 vs. 5" . The 
results are averaged over 10 runs. 

its. For each dataset and each of 10 random train-test 
splits, we performed parameter tuning for each algo- 
rithm separately, and checked the average squared er- 
ror on the test set. The median test errors over all 
datasets are presented in the table below. 







Test Error 


Full Information 


Ridge 


0.110 




Lasso 


0.222 


Partial Information 


AER 


0.320 




Baseline 


0.815 



As can be seen, the AER algorithm manages to achieve 
good performance, not much worse than the full- 
information Lasso algorithm. The Baseline algorithm, 
however, achieves a substantially worse performance, 
in line with our theoretical analysis above. We also 
calculated the test classification error of AER, i.e. 
sign((w,x)) ^ y, and found out that AER, which can 
see only 4 pixels per image, usually perform only a lit- 
tle worse than the full-information algorithms (ridge 
regression and Lasso), which enjoy full access to all 
784 pixels in each image. In particular, the median 
test classification errors of AER, Lasso, and Ridge are 
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3.5%, 1.1%, and 1.3% respectively. 

5. Discussion and Extensions 

In this paper, we provided an efficient algorithm for 
learning when only a few attributes from each train- 
ing example can be seen. The algorithm comes with 
formal guarantees, is provably competitive with algo- 
rithms which enjoy full access to the data, and seems 
to perform well in practice. We also presented sam- 
ple complexity lower bounds, which are only a factor 
d smaller than the upper bound achieved by our algo- 
rithm, and it remains open to bridge this gap. 

Our approach easily extends to other gradient-based 
algorithms besides Pegasos. For example, generalized 
additive algorithms such as p-norm Perceptrons and 
Winnow - see, e.g., (Cesa-Bianchi and Lugosi, 2006). 

An obvious direction for future research is how to deal 
with loss functions other than the squared loss. In up- 
coming work on a related problem, we develop a tech- 
nique which allows us to deal with arbitrary analytic 
loss functions, but in the setting of this paper will lead 
to sample complexity bounds which are exponential in 
d. Another interesting extension we are considering is 
connecting our results to the field of privacy-preserving 
learning (Dwork, 2008), where the goal is to exploit the 
attribute efficiency property in order to prevent acqui- 
sition of information about individual data instances. 
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A. Proofs 

A.l. Proof of Theorem 1 

To ease our calculations, we first show that sampling k elements without replacements and then averaging the 
result has the same expectation as sampling just once. In the lemma below, for a set C we denote the uniform 
distribution over C by U (C) . 

Lemma 1 Let C be a finite set and let f : C — » K be an arbitrary function. Let Ck = {C C C : \C'\ = fc}. 
Then, 

e ay /(c)] = e [/(c)] . 

Proof Denote \C\ = n. We have: 



C'~U(C k ) c~U(C) 



c ,v { J\ E m = A E I E m 
= OTE/( c )ii c " e ^ :ceC '}i 



= -fA E /m 

= (n-l)!fc!(n-fc)! v 
fcn!(fc- l)!(n- fc)! ' 



cec 



= E [/(c)] . 



To prove Theorem 1 we first show that the estimation matrix constructed by the Baseline algorithm is likely to 
be close to the true correlation matrix over the training set. 

Lemma 2 Let A t be the matrix constructed at iteration t of the Baseline algorithm and note that A — ^ Y^tLi At ■ 
Let X = — J27Li x t*-f . Then, with probability of at least 1 — 6 over the algorithm's own randomness we have 
that 



fc V m 

Proof Based on Lemma 1, it is easy to verify that K[A t ] = x^x t . Additionally, since we sample without 
replacements, each element of A t is in [— d 2 /k,d 2 /k] (because we assume [[x^H^ < 1). Therefore, we can apply 
Hocffding's inequality on each element of A and obtain that 

F[\A rtS - X r , s \ > e] < 2e- mk2t2 /^ . 

Combining the above with the union bound we obtain that 

P[3(r,a) : |A r , s -X r , s | > e] < 2d 2 e~ m k " f2/{2di) . 

Calling the right-hand-side of the above 6 and rearranging terms we conclude our proof. ■ 



Next, we show that the estimate of the linear part of the objective function is also likely to be accurate. 
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Lemma 3 Let v t be the vector constructed at iteration t of the Baseline algorithm and note that v = 
m Ylt=i 2j/tVt- Let x = ^ Y^t=i %yt x t- Then, with probability of at least 1 — 5 over the algorithm's own random- 
ness we have that 

k V to 

Proof Based on Lemma 1, it is easy to verify that E[2?/ t v t ] = 2y t x t . Additionally, since we sample k/2 pairs 
without replacements, each element of v t is in [—2d/k,2d/k] (because we assume Hx^U^ < 1) and thus each 
element of 2y t v t is in [— Ad/k, Ad/k] (because we assume that \yt\ < 1). Therefore, we can apply Hoeffding's 
inequality on each element of v and obtain that 

F[\v r -x r \ > e] < 2e- mk2 ' 2 '^ . 

Combining the above with the union bound we obtain that 

P[3(r,a) : \A r>s - X r , s \ > e] < 2de~ mfc2 . 

Calling the right-hand-side of the above 8 and rearranging terms we conclude our proof. ■ 



We next show that the estimated training loss found by the Baseline algorithm, L s (w), is close to the true 
training loss. 

Lemma 4 With probability greater than 1 — S over the Baseline Algorithm's own randomization, for all w such 
that II wlh < B we have that 



Proof Combining Lemma 2 with the boundedness of ||w||i and using Holder's inequality twice we easily get 
that 

t,a v , ^ B2d2 /2\n(2d 2 /S) 

w J (A - X)w\ < — , — • \ — . 

k V to 



Similarly, using Lemma 3 and Holder's inequality, 



T , x , Bd 81n(2d/8) 
|w T (v-x)|< 



A; 



TO 



Combining the above inequalities with the union bound and the triangle inequality we conclude our proof. 



We are now ready to prove Theorem 1. First, using standard risk bounds (based on Rademacher complexities 5 ) 
we know that with probability greater than 1 — 8 over the choice of a training set of to examples, for all w s.t. 
||w||i < B, we have that 

|L s (w)-I,(w)|<0^/Wj . 

Combining the above with Lemma 4 we obtain that for any w s.t. ||w||i < B, 

\L v (w) - Zs(w)| 

< \Ld(w) - L s (w)\ + |Ls(w) - L s {w)\ 

q (b^ [MM) 

I k V to J 

The proof of Theorem 1 follows since the Baseline algorithm minimizes Lg(w). 

5 To bound the Rademacher complexity, we use the boundedness of |jw|ji, Hx^, \y\ to get that the squared loss is O(B) 
Lipschitz on the domain. Combining this with the contraction principle yields the desired Rademacher bound. 
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A.2. Proof of Theorem 2 

We start with the following lemma. 

Lemma 5 Let y t , y t , v t , w t be the values of y,y,v,w, respectively, at iteration t of the AER algorithm. Then, 
for any vector w* s.t. ||w*||i < B we have 

m 

^(f||w t ||l + 2(y t -t/ t )(v t ,w t )) < 
t=i 

m 

x: (t iiw*ni + 2^ - ^)( Vt , w *» + o ( ((B+i)rf) ; /fci ° gM ) . 
t=i 

Proof The proof follows directly from logarithmic regret bounds for strongly convex functions 
(Hazan et al., 2006; Kakade and Shalev-Shwartz, 2008) by noting that according to our construction, 
maxt 2(y t - y t )\\v t \\ 2 < 0((B + 1) d/Vk). ■ 



Let B 2 be such that ||w*|| 2 < B 2 and choose A = {{B + \)d/B2) y/\og{m) / \mk) . Since A||w t || 2 > we obtain 
from Lemma 5 that 



^2 2 (Vt -yt)(v t ,w t - w*) < 



^l|w'||j 



t=l 



+ O ( ((B+1)d) y fclog(m) ) =0^(5 + 1) B 2 



(7) 



For each t, let Vt = 2((w t ,x t ) — j/t)x t and Vt = 2(yt — 2/t) v t- Taking expectation of (7) with respect to the 
algorithm's own randomization, and noting that the conditional expectation of Vt equals Vt, we obtain 



E 



^(Vt,w t - w*) 



< a 



(8) 



From the convexity of the squared loss we know that 

«w t ,xt) -y t f - ((w*,x t ) -y t ) 2 < (V t ,w t - w*) . 

Combining with (8) yields 



E 



^((wt,Xt)- 2 /t) 2 -((w*,Xt)- 2 /t) 5 



,t=i 



< a . 



(9) 



Taking expectation again, this time with respect to the randomness in choosing the training set, and using the 
fact that Wt only depends on previous examples in the training set, we obtain that 



E 



y^L p (wt) - L v (w*) 



t=i 



< a . 



(10) 



Finally, from Jensen's inequality we know that Y^t=i ( w t)] > E[L-p( w )] and this concludes our proof. 



A.3. Proof of Theorem 3 

The outline of the proof is as follows. We define a specific distribution such that only one "good" feature is 
slightly correlated with the label. We then show that if some algorithm learns a linear predictor with an extra 
risk of at most e, then it must know the value of the 'good' feature. Next, we construct a variant of a multi-armed 
bandit problem out of our distribution and show that a good learner can yield a good prediction strategy. Finally, 
we adapt a lower bound for the multi-armed bandit problem given in ( Auer et al. , 2003) , to conclude that in our 
case no learner can be too good. 
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The distribution: We generate a joint distribution over R d x K as follows. Choose some j e [d]. First, each 
feature is generated i.i.d. according to F[xt = 1] = F[xi = — 1] = |. Next, given x and j, j/ is generated according 
to P[y = Xj] = \ + p and F[y = —Xj] = \ — p, where p is set to be y/e. Denote by Pj the distribution mentioned 
above assuming the "good" feature is j. Also denote by P u the uniform distribution over {±l} d+1 . Analogously, 
we denote by Ej and E u expectations w.r.t. Pj and P u . 

A good regressor "knows" j : Wc now show that if we have a good linear regressor than we can know the 
value of j. The optimal linear predictor is w* = 2pe? and the risk of w* is 

Mw*) = E[«w*,x> - yf] = (| +p) (1 - 2pf + (| -p) (1 + 2p) 2 = 1 + 4p 2 - 8p 2 = 1 - V . 

The risk of an arbitrary weight vector under the aforementioned distribution is: 

Mw) = Ej((w, x) - y)] 2 = ]T u;? + E[( W ^ J - yf] = J^w? + w] + 1 - 4p Wj . (11) 

Suppose that L-d(w) — Ld(w*) < e. This implies that: 

1. For all i ^ j we have to 2 < e, or equivalently, \wi\ < y/e. 

2. 1 + w-j — Apwj — (1 — 4j3 2 ) < e and thus |wj — 2p\ < ^fe which gives \wj\ > 2p — y/e 

Since we set p — yfe, the above implies that we can identify the value of j from any w whose risk is strictly 
smaller than Ld(w*) + e. 

Constructing a variant of a multi-armed bandit problem: We now construct a variant of the multi-armed 
bandit problem out of the distribution Pj. Each i <G [d] is an arm and the reward of pulling i is ||a;i + y| € {0, 1}. 
Unlike standard multi-armed bandit problems, here at each round the learner chooses K arms Ofi, ■ ■ ■ , a-t,K, 
which correspond to the K atributes accessed at round t, and his reward is defined to be the average of the 
rewards of the chosen arms. At the end of each round the learner observes the value of x t at a t .i, ■ ■ ■ , a t ,K, as 
well as the value of yt- Note that the expected reward is \ + pj^ 5Z i= i %»* «=jl- Therefore, the total expected 
reward of an algorithm that runs for T rounds is upper bounded by \T + pE[Nj], where Nj is the number of 
times j e {a t ,i, . . . ,a t .K}- 

A good learner yields a strategy: Suppose that we have a learner that can learn a linear predictor with 
L-p(w) — L-p(w*) < e using m examples (on average). Since we have shown that once Ld(w) — Lp(w*) < e we 
know the value of j, we can construct a strategy for the multi-armed bandit problem in a straightforward way; 
Simply use the first m examples to learn w and from then on always pull the arm j, namely, at.i = . . . = a t ,K = j- 
The expected reward of this algorithm is at least 

\m + (T-m) (| +p) = \T+{T-m)p . 

An upper bound on the reward of any strategy: Consider an arbitrary prediction algorithm. At round t 
the algorithm uses the history (and its own random bits, which we can assume are set in advance) to ask for the 
current K attributes a tj i, . . . , a t: K- The history is the value of x s at a Sj i, . . . , a s ^K as well as the value of y s , for 
all s < t. That is, we can denote the history at round t to be r* = (ri t i, . . . , ■ ■ ■ , (^t-i.i, • • • , ft-i,K+i)- 

Therefore, on round t the algorithm uses a mapping from r* to [d] K . We use r as a shorthand for r T+1 . The 
following lemma shows that any function of the history cannot distinguish too well between the distribution Pj 
and the uniform distribution. 

Lemma 6 Let f : {— 1, 1}( K+1 ) T — )• [0, M] be any function defined on a history sequence r = 
(ri,i, . . . , Ti^k+i), ■ ■ ■ , ( r T,ij ■ • • i r T,K+i)- Let Nj be the number of times the algorithm calculating f picks ac- 
tion j among the selected arms. Then, 



Ei[/(r)] < E„[/(r)] + M\J — log(l — 4p 2 )E u [Aj] 
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Proof For any two distributions P, Q we let ||P — Q||i = X) r \P[ r ] ~~ Q\ Y ]\ t> e the total variation distance and 
let KL(P,Q) = ^2 r P[r] log(P[r]/Q[r]) be the KL divergence. Using Holder inequality we know that Ej[/(r)] — 
E u [/(r)] < M\\Pj - P u \\ x . Additionally, using Pinsker's inequality we have \\\Pj - P u \\\ < KL(P u ,Pj). Finally, 
the chain rule and simple calculations yield, 

- n (K+l)T- T 



KL { p u , Pj )=Y,{\r +1)T Y.^ 

r t=l 
^ E( , )(W ^ log 



P u [n.- 



nt-ll 



-t-11 



>1\K+1 
,2) 



{\) K+1 + (1)" pAvf =1 (a t ,i=j)] s ^ n ( x t,jyt) t 



t=l 

T 



t=l 



E E « 

t=i 

T 

E^« 



.1Vti(«i, ( =j)l 
' if 

.*=i 



log(l + 2p 



sgnfatjift))) 



log(l + 2p sgn(o; tiJ y t )) 



(since xtjyt is independent of a ti i, . . . ,a ti K) 

T 

- (|(- log(l + 2p)) + \{- log(l - 2p))) 2 P u 



t=i 



A' 



= -±log(l-4p 2 )E M [A,] . 
Combining all the above we conclude our proof. 



We have shown previously that the expected reward of any algorithm is bounded above by \T + pEj[Nj]. 
Applying Lemma 6 above on /(r) = Nj e {0,1, ... ,T} we get that 

E,[A,] < E„[^] + r v /-log(l-4p2)E u [JV J -] . 
Therefore, the expected reward of any algorithm is at most 

\T +p (VuWj] + Tyf- log(l - V)E„[A,]^ . 

Since the adversary will choose j to minimize the above and since the minimum over j is smaller then the 
expectation over choosing j uniformly at random we have that the reward against an adversarial choice of j is 
at most 

\ T + p \ E (w-i + T \j- 1 °g( 1 - v) E «ra) • (12) 

Note that 

1 d 1 KT 

- Y, ^u[Nj] = -V u [N 1 + ... + N d }< — . 

Combining this with (12) and using Jensen's inequality we obtain the following upper bound on the reward 

hT + p(%T + TyJ-log(l-4p>)%T) . 

Assuming that e < 1/16 we have that 4p 2 = 4e < 1/4 and thus using the inequality — log(l — q) < \q, which 
holds for q e [0, 1/4], we get the upper bound 

iT + p^T + T^fp^ . (13) 
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Concluding the proof: Take a learning algorithm that finds an e-good predictor using m examples. Since 
the reward of the strategy based on this learning algorithm cannot exceed the upper bound given in (13) we 
obtain that: 



\T + (T - m)p < \T + p (f T + TyJ fp>T) 



which solved for m gives 



m>T(l-f-7^r) . 
Since we assume d > AK, choosing T = [d/ (96Kp 2 )\ , and recalling p 2 = e, gives 



T 1 

m > = ~ 

~ 2 2 



96ife 



A. 4. Proof of Theorem 4 

Let w* = (1/3, 1/3, 1/3). Let x e {±1} 3 be distributed uniformly at random and y is determined detcrministi- 
cally to be (w*,x). Then, Lx>(w*) = 0. However, any algorithm that only view 2 attributes have an uncertainty 
about the label of at least ±|, and therefore its expected squared error is at least 1/9. Formally, suppose the 
algorithm asks for the first two attributes and form its prediction to be y. Since the generation of attributes is 
independent, we have that the value of x% does not depend on xi 7 x 2 , and y, and therefore 

E[(y-(w*,x)) 2 ] = E[(y-w* 1 x 1 -w*x 2 -w*x 3 ) 2 ] =E[(y-w i [x 1 -w*x 2 ) 2 }+E[(w*x 3 ) 2 ] > + (1/3) 2 E[.t 2 ] = 1/9 , 
which concludes our proof. 



